Code
#loading packages
library(DiagrammeR)Missing data occurs when there are missing values in a dataset. There are many reasons why this occurs. It can be intentional or unintentional and can be classified into the following three categories, otherwise known as missingness mechanisms (Mainzer et al. 2023):
Missing completely at random (MCAR) is the probability of missing data being completely independent of any other variables.
Missing at random (MAR) is the probability of missing data being related to the observed values.
Missing not at random (MNAR) is the probability of missing data being dependent on the missing and observed values.
Figure 1: Graphical Representation of Missingness Mechanisms (Schafer and Graham 2002)
(X are the completely observed variables. Y are the partly missing variables. Z is the component of the cause of missingness unrelated to X and Y. R is the missingness.)
Looking for patterns in the missing data can help us to determine which category they belong. These mechanisms are important in determining how to handle the missing data. MCAR would be the best case scenario but seldom occur. MAR and MNAR are more common.
The problem with ignoring any missing values is that it does not give a true representation of the dataset and can lead to bias when analyzing. This reduces the statistical power of the analysis (van_Ginkel et al. 2020). To enhance the quality of the research, the following should be followed: explicitly acknowledge missing data problems and the conditions under which they occur and employ principled methods to handle the missing data (Dong and Peng 2013).
There are three types of methods to deal with missing data, the likelihood and Bayesian method, weighting methods, or imputation methods (Cao et al. 2021). Missing data can also be handled by simply deleting.
Likelihood Bayesian method is when information from a previous predictive distribution is combined with evidence obtained in a sample to predict a value. It requires technical coding and advanced statistical knowledge.
The weighting method is a traditional approach when weights from available data are used to adjust for non-response in a survey. Inefficiency occurs when there are extreme weights or a need for many weights.
The imputation method is when an estimate from the original dataset is used to estimate the missing value. There are two types of imputation: single and multiple.
Listwise deletion is when the entire observation is removed from the dataset. Deleting missing data can lead to the loss of important information regarding your dataset and is therefore not recommended. In certain cases, when the amount of missing data is small and the type is MCAR, listwise deletion can be used. There usually won’t be bias but potentially important information may be lost.
T-tests and chi-square tests can be used to assess pairs of predictor variables to determine whether the groups’ means differ significantly. According to van_Ginkel et al. (2020), if significant, the null hypothesis is rejected, therefore, indicating that the missing values are not randomly scattered throughout the data. This implies that the missing data is MAR or MNAR. Conversely, if non-significant, this implies that the data cannot be MAR. This does not eliminate the possibility that it is not MNAR–other information about the population is needed to determine this.
Whenever missing data is categorized as MAR or MNAR, listwise deletion would be wasteful, and the analysis biased. Alternate methods of dealing with the missing data is recommended: either pairwise deletion or imputation.
Pairwise deletion is when only the missing variable of an observation is removed. It allows more data to be analyzed than listwise deletion but limits the ability to make inferences of the total sample. For this reason, it is recommended to use imputation to properly deal with missing data.
Imputation is the preferred method to handle missing data. It consists of replacing missing data with an estimate obtained from the original, available data. After imputation, there will be a full dataset to analyze. According to Wulff and Jeppesen (2017), 3-5 imputations are sufficient, and 10 imputations are more than enough. More recent research has indicated that to improve statistical power, the number of imputations created should be at least equal to the percent of missing data (5% equals 5 imputations, 10% equals 10 imputations, 20% equals 20 imputations, etc.) (Pedersen et al. 2017).
Single, or univariate, imputation is when only one estimate is used to replace the missing data. Methods of single imputation include using the mean, the last observation carried forward, and random imputation. The following is a brief explanation of each:
Using the mean to replace a missing value is a straight-forward process. The mean of the dataset is calculated, including the missing value. The mean is then multiplied by the number of observations in the study. Next, the known values are subtracted from the product, and this gives an estimate that can be used for any missing values. The problem with this method is that it reduces the variance which leads to a smaller confidence interval.
Last Observation Carried Forward (LOCF) is a technique of replacing a missing value in longitudinal studies with a previously observed value, the most recent value is carried forward (Streiner 2008). The problem with this method is that it assumes that the previous observed value is perpetual when in reality that most likely is not the case.
Random imputation is a method of randomly drawing an observation and using that observation for any of the missing values. The problem with this method is that it introduces additional variability.
These single imputation methods are flawed. They often result in underestimation of standard errors or too small p-values (Dong and Peng 2013), which can cause bias in the analysis. Therefore, multiple imputation is the superior method because it handles missing data better and provides less biased results.
Multiple, or multivariate, imputation is when various estimates are used to replace the missing data by creating multiple datasets from versions of the original dataset. It can be done by using a regression model, or a sequence of regression models, such as linear, logistic and Poison. A set of m plausible values are generated for each unobserved data point, resulting in M complete data sets (Dong and Peng 2013). The new values are randomly drawn from predictive distributions either through joint modeling (JM, which is not used much anymore) or fully conditional specification (FCS) (Wongkamthong and Akande 2023). It is then analyzed and the results are combined to obtain a single value for the missing data.
The purpose of multiple imputation is to create a pool of imputed data for analysis, but if the pooled results are lacking, then multiple imputation should not be done (Mainzer et al. 2023). Another reason not to use multiple imputation is if there are very few missing values; there may be no benefit in using it. Also worth noting is some statistical analyses software already have built-in features to deal with missing data.
Multiple imputation by chained methods, otherwise known as MICE, is the most common and preferred method of multiple imputation (Wulff and Jeppesen 2017). It provides a more reliable way to analyze data with missing values. For this reason, this paper will focus on the methodology and application of the MICE process.
#loading packages
library(DiagrammeR)Figure 2: Flowchart of the MICE-process based on procedures proposed by Rubin (Wulff and Jeppesen 2017)
DiagrammeR::grViz("digraph {
# initiate graph
graph [layout = dot, rankdir = LR, label = 'The MICE-Process\n\n',labelloc = t, fontcolor = DarkSlateBlue, fontsize = 45]
# global node settings
node [shape = rectangle, style = filled, fillcolor = AliceBlue, fontcolor = DarkSlateBlue, fontsize = 35]
bgcolor = none
# label nodes
incomplete [label = 'Incomplete data set']
imputed1 [label = 'Imputed \n data set 1']
estimates1 [label = 'Estimates from \n analysis 1']
rubin [label = 'Rubin rules', shape = diamond]
combined [label = 'Combined results']
imputed2 [label = 'Imputed \n data set 2']
estimates2 [label = 'Estimates from \n analysis 2']
imputedm [label = 'Imputed \n data set m']
estimatesm [label = 'Estimates from \n anaalysis m']
# edge definitions with the node IDs
incomplete -> imputed1 [arrowhead = vee, color = DarkSlateBlue]
imputed1 -> estimates1 [arrowhead = vee, color = DarkSlateBlue]
estimates1 -> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputed2 [arrowhead = vee, color = DarkSlateBlue]
imputed2 -> estimates2 [arrowhead = vee, color = DarkSlateBlue]
estimates2-> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputedm [arrowhead = vee, color = DarkSlateBlue]
imputedm -> estimatesm [arrowhead = vee, color = DarkSlateBlue]
estimatesm -> rubin [arrowhead = vee, color = DarkSlateBlue]
rubin -> combined [arrowhead = vee, color = DarkSlateBlue]
}")*Rubin’s Rules: Average the estimates across m estimates. Calculate the standard errors and variance of m estimates. Combine using an adjustment term (1+1/m).
There are other methods of imputation worth noting and are briefly described below.
Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical (van_Ginkel et al. 2020).
Predictive Mean Matching is also based on a linear regression model. The approach is the same as regression imputation when it comes to categorical missing values but different for continuous variables. Instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable (van_Ginkel et al. 2020).
Hot Deck (HD) imputation is when a missing value is replaced by an observed response of a similar unit, also known as the donor. It can be either random or deterministic, which is based on a metric or value (Thongsri and Samart 2022). It does not rely on model fitting.
Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value (Thongsri and Samart 2022). This maintains the variability of the data.
Random Forest (RF) Imputation is based on machine learning algorithms. Missing values are first replaced with the mean or mode of that particular variable and then the dataset is split into a training set and a prediction set (Thongsri and Samart 2022). The missing values are then replaced with predictions from these sets. This type of imputation can be used on continuous or categorical variables with complex interactions.
According to Rubin’s Rule, in multiple imputation m imputed values are created for each of the missing data and result in M complete datasets. For each of the M datasets, an estimate of \(\theta\) is acquired. Let \({\hat{\theta}}_{m}\) and \({\hat{\phi}}_{m}\) be an estimator of the variance of \({\hat{\theta}}_{m}\) based on the Mth complete dataset (Arnab 2017).
The combined estimator of \(\theta\) is given by:
\[{\hat{\theta}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M} {\hat{\theta}}_{m}\]
The proposed variance estimator of \({\hat{\theta}}_{M}\) is given by:
\[{\hat{\Phi}}_{M} = {\overline{\phi}}_{M} + (1+\displaystyle \frac{1}{M})B_{M}\] with a correction factor of \((1+\displaystyle \frac{1}{M})\),
where the average within imputation variance is:
\[{\overline{\phi}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M}{\hat{\phi}}_m\]
and the between imputation variance is:
\[B_{M} = \displaystyle \frac{1}{M-1}\sum_{m = 1}^{M}({\hat{\theta}}_{m}-{\overline{\theta}}_{M})^{2}\].
All multiple imputation methods have the following assumptions:
Observed data follow a multivariate normal distribution.
Missing data are classified as MAR, which is the probability that a missing value depends only on observed values and not unobserved values.
The parameters \({\theta}\) of the data model and the parameters \({\phi}\) of the model for the missing values are distinct. That is, knowing the values of \({\theta}\) does not provide any information about \({\phi}\).
These assumptions are essential otherwise we wouldn’t be able to predict a plausible replacement value for the missing data.
The chained equation process has the following steps (Azur et al. 2011):
Step 1: Using simple imputation, replace the missing data with this value, referred to as the “place holder”.
Step 2: The “place holder” values for one variable are set back to missing.
Step 3: The observed values from this variable (dependent variable) are regressed on the other variables (independent variables) in the model, using the same assumptions when performing linear, logistic, or Poison regression.
Step 4: The missing values are replaced with predictions “m” from this newly created model.
Step 5: Repeat Steps 2-4 for each variable that have missing values until all missing values have been replaced.
Step 6: Repeat Steps 2-4, updating imputations each cycle for as many “M” cycles/imputations that are required.
# load libraries
library(gtsummary)
library(gt)
library(ggplot2)
library(VIM, warn.conflicts=FALSE)
library(dplyr, warn.conflicts=FALSE)
library(mice, warn.conflicts=FALSE)
#library(jtools)
library(tidyverse)
library(modelsummary)
# load data
original = read.csv("kc_house_data.csv")kc_house_data.csv
This dataset contains the sale prices of 21,613 houses in King County, Seattle between May 2014 to May 2015. The original dataset contained 21 columns with various selling attributes. For the purpose of this project, we have condensed the variables to the following 4: sales price, number of bedrooms, number of bathrooms, and square footage of living space.
| Variable | Type | Description |
|---|---|---|
| price | numeric | Sale price of homes sold between May 2014 and May 2015 in King County, Seattle; range from $75,000 to $7,700,000 |
| bedrooms | integer | Number of bedrooms in homes sold between May 2014 and May 2015 in King County, Seattle; range from 0 to 33. |
| bathrooms | numeric | Number of bathrooms in homes sold between May 2014 and May 2015 in King County, Seattle; range from 0 to 8. |
| sqft_living | integer | Number of square feet of living space in homes sold between May 2014 and May 2015 in King County, Seattle; range from 290 sf to 13,540 sf. |
# table summary of data
original %>%
tbl_summary(missing_text = "NA") %>%
add_n() %>%
modify_header(label ~ "**Variable**") %>%
modify_caption("**Summary of Data**") %>%
bold_labels() %>%
as_gt() %>%
tab_options(column_labels.background.color = "#FFFFFF00",
table.background.color = "#FFFFFF00")| Variable | N | N = 21,6131 |
|---|---|---|
| price | 21,613 | 450,000 (321,950, 645,000) |
| bedrooms | 21,613 | 3 (3, 4) |
| bathrooms | 21,613 | 2.25 (1.75, 2.50) |
| sqft_living | 21,613 | 1,910 (1,427, 2,550) |
| 1 Median (IQR) | ||
# plot the data
ggplot(original, aes(x=sqft_living, y=price)) +
geom_point(alpha=0.5, color='darkblue') +
labs(x="sqft_living", y="price")Using the MICE (Multivariate Imputation by Chained Equations) package in R, a statistical programming software, we will be demonstrating how to impute missing data in the house sales dataset. We will run through the MICE process and provide results of the imputations. We will then check the accuracy of the imputations by comparing these results to the results of the original, complete data set.
First, we evaluate the dataset for missing values by using the is.na() function:
# check for missing data
sapply(original, function(x) sum(is.na(x))) price bedrooms bathrooms sqft_living
0 0 0 0
There is no missing data; this is the original, complete dataset. We will artificially create some missing values. We do this by first duplicating the dataset. We want to preserve the original so that we can later check the accuracy of the imputation process. Next, we randomly replace the desired amount of values with NA to mimic missing data. We will assign 200 NA values to the following variables: bedrooms, bathrooms, and sqft_living and 100 NA values to the price variable.
# duplicate dataset
house <- original
# replace some values with NA to create missing data
set.seed(10)
house[sample(1:nrow(house), 200), "sqft_living"] <- NA
house[sample(1:nrow(house), 200), "bedrooms"] <- NA
house[sample(1:nrow(house), 200), "bathrooms"] <- NA
house[sample(1:nrow(house), 100), "price"] <- NAUsing the is.na() function, we can check to confirm we have created missing data:
# check for missing data
sapply(house, function(x) sum(is.na(x))) price bedrooms bathrooms sqft_living
100 200 200 200
Our dataset now has 700 NA/missing values. This equates to about 3% of the data (700/21,613).
Next, we need to check the missingness by looking for patterns in the dataset by using the md.pattern() function. Of course our missing data was randomly created, so we know we have MCAR.
# check the missingness patterns
md.pattern(house, rotate.names = TRUE) price bedrooms bathrooms sqft_living
20920 1 1 1 1 0
195 1 1 1 0 1
196 1 1 0 1 1
2 1 1 0 0 2
197 1 0 1 1 1
2 1 0 1 0 2
1 1 0 0 1 2
98 0 1 1 1 1
1 0 1 1 0 2
1 0 1 0 1 2
100 200 200 200 700
Blue are the observed values, and red are the missing values. There are 20,920 complete observations (not missing data). The total number of missing data for each variable is as follows: price = 100, bedrooms = 200, bathrooms = 200, sqft_living = 200, which equal the 700 total missing values. This visual gives a breakdown of where the missing data occurs.
Another way to look for patterns in the missing data is to use the aggr() function:
# histogram, patterns and count for missing data
aggr_plot <- aggr(house, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
Variables sorted by number of missings:
Variable Count
bedrooms 0.009253690
bathrooms 0.009253690
sqft_living 0.009253690
price 0.004626845
It shows each variable and the percent of missing data for each (as a decimal):
bedrooms: 0.9%
bathrooms: 0.9%
sqft_living: 0.9%
price: 0.5%
Another visualization to check for the missingness is a margin plot:
# margin plot for missingness
marginplot(house[c(1,4)])The red box plot on the left shows the distribution of sqft_living with price missing. The blue box plot shows the distribution of the remaining data points. The same is true for the price box plots on the bottom. We expect the red and blue boxplots to be very similar if our assumptions of MCAR data are true. In this case, they are.
Since we are missing about 3% of our data, we need to perform at least 3 imputations. This can be done by using the mice() function. Since 5 is the default, we will use that; if you need to use something different, you must set m equal to the number of imputations that you desire. The set.seed will be given the value 1337 (any number can be used here) to retrieve the same results each time the multiple imputation is performed.
# impute the data 5 times (default)
imp = mice(data = house, set.seed = 1337)
iter imp variable
1 1 price bedrooms bathrooms sqft_living
1 2 price bedrooms bathrooms sqft_living
1 3 price bedrooms bathrooms sqft_living
1 4 price bedrooms bathrooms sqft_living
1 5 price bedrooms bathrooms sqft_living
2 1 price bedrooms bathrooms sqft_living
2 2 price bedrooms bathrooms sqft_living
2 3 price bedrooms bathrooms sqft_living
2 4 price bedrooms bathrooms sqft_living
2 5 price bedrooms bathrooms sqft_living
3 1 price bedrooms bathrooms sqft_living
3 2 price bedrooms bathrooms sqft_living
3 3 price bedrooms bathrooms sqft_living
3 4 price bedrooms bathrooms sqft_living
3 5 price bedrooms bathrooms sqft_living
4 1 price bedrooms bathrooms sqft_living
4 2 price bedrooms bathrooms sqft_living
4 3 price bedrooms bathrooms sqft_living
4 4 price bedrooms bathrooms sqft_living
4 5 price bedrooms bathrooms sqft_living
5 1 price bedrooms bathrooms sqft_living
5 2 price bedrooms bathrooms sqft_living
5 3 price bedrooms bathrooms sqft_living
5 4 price bedrooms bathrooms sqft_living
5 5 price bedrooms bathrooms sqft_living
We need to create the imputed dataset using the complete() function:
# create imputed dataset
imputed <- complete(imp)We now need to check the imputed dataset to make sure there is no missing data. Again, we use the is.na() function:
# check imputed dataset for missing values
sapply(imputed, function(x) sum(is.na(x))) price bedrooms bathrooms sqft_living
0 0 0 0
We can check the quality of the imputations by running a strip plot, which is a single axis scatter plot. It will show the distribution of the imputed values over the observed values. The blue values are the observed values and the red values are the imputed values. We want the imputations to be values that could have been observed had the data not been missing.
# stip plots to check the quality of the imputations
par(mfrow=c(2,2))
stripplot(imp, price, pch = 19, cex=1, xlab = "Imputation number")stripplot(imp, bedrooms, pch = 19, cex=1, xlab = "Imputation number")stripplot(imp, bathrooms, pch = 19, cex=1, xlab = "Imputation number")stripplot(imp, sqft_living, pch = 19, cex=1, xlab = "Imputation number")Next, we will pool the results of the complete dataset with the imputed dataset to arrive at estimates that will properly account for the missing data. We fit the complete model with the with() function and display the summary of the pooled results. It will give us the estimate, standard error, test statistic, degrees of freedom, and the p-value for each variable.
# fit complete-data model
fit <- with(imp, lm(price~bedrooms+bathrooms+sqft_living))
# pool and summarize the results
summary(pool(fit)) term estimate std.error statistic df p.value
1 (Intercept) 75043.7775 6952.563287 10.793685 14431.278 4.677205e-27
2 bedrooms -57930.4966 2357.322483 -24.574702 7683.745 1.936143e-128
3 bathrooms 7624.2182 3531.505441 2.158914 11516.540 3.087739e-02
4 sqft_living 309.7992 3.114475 99.470779 8431.335 0.000000e+00
We now have a full, complete dataset that we can analyze!
We will now compare the new, imputed dataset to our original, complete dataset to check for accuracy of the imputation method by chained equations:
# fit original, complete dataset
og_lm = lm(price~bedrooms+bathrooms+sqft_living, data = original)
# compare imputed dataset to the original dataset
msummary(list("Imputed" = pool(fit), "Original" = og_lm), title = "Comparison", statistic = "p.value", estimate = "estimate", gof_omit = 'IC|Log|Adj|F|RMSE')| Imputed | Original | |
|---|---|---|
| (Intercept) | 75043.777 | 74662.099 |
| (<0.001) | (<0.001) | |
| bedrooms | −57930.497 | −57906.631 |
| (<0.001) | (<0.001) | |
| bathrooms | 7624.218 | 7928.708 |
| (0.031) | (0.024) | |
| sqft_living | 309.799 | 309.605 |
| (<0.001) | (<0.001) | |
| Num.Obs. | 21613 | 21613 |
| Num.Imp. | 5 | |
| R2 | 0.507 | 0.507 |
The estimates for the imputed dataset are very close to the estimates for the original dataset. They are almost exact, with the exception of bathrooms which is off just slightly. The p-values are all very similar; the differences are very minimal. They are all significant in the imputed dataset, as well as the original dataset. These results indicate a high accuracy of the imputation process. We can conclude that multiple imputation by chained equation is a reliable source to impute missing data.
In conclusion, missing data can occur in research for a variety of reasons. It is never a good idea to ignore it. Doing this will lead to biased estimates of parameters, loss of information, decreased statistical power, and weak reliability of findings (Dong and Peng 2013). The best course of action is to impute the missing data by using multiple imputation. When missing data is discovered, it is important to first identify it and look for missing data patterns. Next, define the variables in the dataset that are related to the missing values that will be used for imputation. Create the necessary number of complete data sets. Run the models and combine them using the imputed values, and finally, analyze the complete dataset. Performing these steps will minimize the adverse effects caused by missing data on the analysis (Pampka, Hutcheson, and Williams 2016).